| Final Project |
| Baseball Statistics |
| [Fall 2015] DataVisualizations, Cannata |
| 12/7/15 |
Brian Bargas, German Martinez, Chris Pena
Our Data was obtained from the Sean Lahman Online Baseball Database. It contains complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. Specifically we took the master, hall of fame, and xxx cvs.
Once we had this csv chosen we reformatted them using the ETL.R scripts then imported them into our oracle database.
The csv we used were; Batting Hall of Fame Master Salary Team TeamFrancise
source("~/Documents/Academic/CompSci/Data/FinalProject/DV_FinalProject/00 Doc/filesummary.R", echo = TRUE)
##
## > require(tidyr)
##
## > require(dplyr)
##
## > require(ggplot2)
##
## > setwd("~/Documents/Academic/CompSci/Data/FinalProject/DV_FinalProject/01 Data")
##
## > file_path <- "Batting.reformatted.csv"
##
## > df <- read.csv(file_path, stringsAsFactors = FALSE)
##
## > summary(df)
## playerID yearID stint teamID
## Length:99846 Min. :1871 Min. :1.000 Length:99846
## Class :character 1st Qu.:1932 1st Qu.:1.000 Class :character
## Mode :character Median :1971 Median :1.000 Mode :character
## Mean :1963 Mean :1.077
## 3rd Qu.:1996 3rd Qu.:1.000
## Max. :2014 Max. :5.000
##
## lgID G AB R
## Mode:logical Min. : 0.00 Min. : 0.0 Min. : 0.00
## NA's:99846 1st Qu.: 13.00 1st Qu.: 7.0 1st Qu.: 0.00
## Median : 35.00 Median : 58.0 Median : 5.00
## Mean : 51.46 Mean :150.6 Mean : 19.98
## 3rd Qu.: 81.00 3rd Qu.:252.0 3rd Qu.: 30.00
## Max. :165.00 Max. :716.0 Max. :192.00
## NA's :5149 NA's :5149
## H X2B X3B HR
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 11.00 Median : 2.000 Median : 0.000 Median : 0.000
## Mean : 39.43 Mean : 6.654 Mean : 1.385 Mean : 2.944
## 3rd Qu.: 64.00 3rd Qu.:10.000 3rd Qu.: 2.000 3rd Qu.: 3.000
## Max. :262.00 Max. :67.000 Max. :36.000 Max. :73.000
## NA's :5149 NA's :5149 NA's :5149 NA's :5149
## RBI SB CS BB
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 4.00 Median : 0.000 Median : 0.000 Median : 3.00
## Mean : 18.04 Mean : 3.182 Mean : 1.337 Mean : 13.88
## 3rd Qu.: 27.00 3rd Qu.: 2.000 3rd Qu.: 1.000 3rd Qu.: 20.00
## Max. :191.00 Max. :138.000 Max. :42.000 Max. :232.00
## NA's :5573 NA's :6449 NA's :28603 NA's :5149
## SO IBB HBP SH
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 10.00 Median : 0.00 Median : 0.000 Median : 1.000
## Mean : 21.57 Mean : 1.23 Mean : 1.114 Mean : 2.486
## 3rd Qu.: 30.00 3rd Qu.: 1.00 3rd Qu.: 1.000 3rd Qu.: 3.000
## Max. :223.00 Max. :120.00 Max. :51.000 Max. :67.000
## NA's :12987 NA's :41712 NA's :7959 NA's :11487
## SF GIDP
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 1.000
## Mean : 1.16 Mean : 3.225
## 3rd Qu.: 2.00 3rd Qu.: 5.000
## Max. :19.00 Max. :36.000
## NA's :41181 NA's :31257
##
## > file_path <- "HallOfFame.reformatted.csv"
##
## > df2 <- read.csv(file_path, stringsAsFactors = FALSE)
##
## > summary(df2)
## playerID yearid votedBy ballots
## Length:4088 Min. :1936 Length:4088 Min. : 78.0
## Class :character 1st Qu.:1950 Class :character 1st Qu.:226.0
## Mode :character Median :1964 Mode :character Median :274.0
## Mean :1969 Mean :319.7
## 3rd Qu.:1986 3rd Qu.:425.0
## Max. :2015 Max. :581.0
## NA's :193
## needed votes inducted category
## Min. : 59.0 Min. : 0.00 Length:4088 Length:4088
## 1st Qu.:175.0 1st Qu.: 2.00 Class :character Class :character
## Median :206.0 Median : 10.00 Mode :character Mode :character
## Mean :243.3 Mean : 50.52
## 3rd Qu.:321.0 3rd Qu.: 63.00
## Max. :436.0 Max. :555.00
## NA's :350 NA's :193
## needed_note
## Length:4088
## Class :character
## Mode :character
##
##
##
##
##
## > file_path <- "Master.reformatted.csv"
##
## > df3 <- read.csv(file_path, stringsAsFactors = FALSE)
##
## > summary(df3)
## playerID birthYear birthMonth birthDay
## Length:18589 Min. :1820 Min. : 1.000 Min. : 1.00
## Class :character 1st Qu.:1894 1st Qu.: 4.000 1st Qu.: 8.00
## Mode :character Median :1935 Median : 7.000 Median :16.00
## Mean :1930 Mean : 6.626 Mean :15.62
## 3rd Qu.:1967 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :1994 Max. :12.000 Max. :31.00
## NA's :145 NA's :315 NA's :472
## birthCountry birthState birthCity deathYear
## Length:18589 Length:18589 Length:18589 Min. :1872
## Class :character Class :character Class :character 1st Qu.:1941
## Mode :character Mode :character Mode :character Median :1966
## Mean :1963
## 3rd Qu.:1988
## Max. :2014
## NA's :9364
## deathMonth deathDay deathCountry deathState
## Min. : 1.000 Min. : 1.00 Length:18589 Length:18589
## 1st Qu.: 3.000 1st Qu.: 8.00 Class :character Class :character
## Median : 6.000 Median :15.00 Mode :character Mode :character
## Mean : 6.486 Mean :15.55
## 3rd Qu.:10.000 3rd Qu.:23.00
## Max. :12.000 Max. :31.00
## NA's :9365 NA's :9366
## deathCity nameFirst nameLast
## Length:18589 Length:18589 Length:18589
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## nameGiven weight height bats
## Length:18589 Min. : 65.0 Min. :43.00 Length:18589
## Class :character 1st Qu.:170.0 1st Qu.:71.00 Class :character
## Mode :character Median :185.0 Median :72.00 Mode :character
## Mean :185.6 Mean :72.24
## 3rd Qu.:197.0 3rd Qu.:74.00
## Max. :320.0 Max. :83.00
## NA's :872 NA's :809
## throws debut finalGame
## Length:18589 Length:18589 Length:18589
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## retroID bbrefID
## Length:18589 Length:18589
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
##
## > file_path <- "Salaries.csv"
##
## > df4 <- read.csv(file_path, stringsAsFactors = FALSE)
##
## > summary(df4)
## yearID teamID lgID playerID
## Min. :1985 Length:24758 Length:24758 Length:24758
## 1st Qu.:1993 Class :character Class :character Class :character
## Median :2000 Mode :character Mode :character Mode :character
## Mean :2000
## 3rd Qu.:2007
## Max. :2014
## salary
## Min. : 0
## 1st Qu.: 260000
## Median : 525000
## Mean : 1932905
## 3rd Qu.: 2199643
## Max. :33000000
##
## > file_path <- "Teams.reformatted.csv"
##
## > df5 <- read.csv(file_path, stringsAsFactors = FALSE)
##
## > summary(df5)
## yearID lgID teamID franchID
## Min. :1871 Length:2775 Length:2775 Length:2775
## 1st Qu.:1919 Class :character Class :character Class :character
## Median :1962 Mode :character Mode :character Mode :character
## Mean :1954
## 3rd Qu.:1991
## Max. :2014
##
## divID Rank G Ghome
## Length:2775 Min. : 1.000 Min. : 6.0 Min. :44.00
## Class :character 1st Qu.: 2.000 1st Qu.:153.0 1st Qu.:77.00
## Mode :character Median : 4.000 Median :157.0 Median :80.00
## Mean : 4.119 Mean :150.2 Mean :78.43
## 3rd Qu.: 6.000 3rd Qu.:162.0 3rd Qu.:81.00
## Max. :13.000 Max. :165.0 Max. :84.00
## NA's :399
## W L DivWin WCWin
## Min. : 0.00 Min. : 4.00 Length:2775 Length:2775
## 1st Qu.: 66.00 1st Qu.: 65.00 Class :character Class :character
## Median : 77.00 Median : 76.00 Mode :character Mode :character
## Mean : 74.68 Mean : 74.68
## 3rd Qu.: 87.00 3rd Qu.: 86.50
## Max. :116.00 Max. :134.00
##
## LgWin WSWin R AB
## Length:2775 Length:2775 Min. : 24.0 Min. : 211
## Class :character Class :character 1st Qu.: 612.0 1st Qu.:5124
## Mode :character Mode :character Median : 690.0 Median :5384
## Mean : 681.9 Mean :5138
## 3rd Qu.: 764.0 3rd Qu.:5516
## Max. :1220.0 Max. :5781
##
## H X2B X3B HR
## Min. : 33 Min. : 3.0 Min. : 0.00 Min. : 0.0
## 1st Qu.:1298 1st Qu.:193.0 1st Qu.: 31.00 1st Qu.: 41.0
## Median :1393 Median :230.0 Median : 41.00 Median :106.0
## Mean :1346 Mean :227.1 Mean : 47.27 Mean :100.5
## 3rd Qu.:1468 3rd Qu.:270.0 3rd Qu.: 60.00 3rd Qu.:148.0
## Max. :1783 Max. :376.0 Max. :150.00 Max. :264.0
##
## BB SO SB CS
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:424.5 1st Qu.: 503.0 1st Qu.: 64.0 1st Qu.: 35.00
## Median :493.0 Median : 750.0 Median : 96.0 Median : 46.00
## Mean :473.7 Mean : 732.2 Mean :112.6 Mean : 48.97
## 3rd Qu.:554.0 3rd Qu.: 962.0 3rd Qu.:142.0 3rd Qu.: 58.00
## Max. :835.0 Max. :1535.0 Max. :581.0 Max. :191.00
## NA's :120 NA's :144 NA's :859
## HBP SF RA ER
## Min. : 26.00 Min. :24.00 Min. : 34.0 Min. : 25.0
## 1st Qu.: 47.00 1st Qu.:39.00 1st Qu.: 608.5 1st Qu.: 499.0
## Median : 54.50 Median :44.00 Median : 688.0 Median : 590.0
## Mean : 56.28 Mean :44.92 Mean : 681.9 Mean : 570.2
## 3rd Qu.: 64.00 3rd Qu.:50.00 3rd Qu.: 764.0 3rd Qu.: 666.0
## Max. :103.00 Max. :75.00 Max. :1252.0 Max. :1023.0
## NA's :2325 NA's :2325
## ERA CG SHO SV
## Min. :1.220 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:3.335 1st Qu.: 15.00 1st Qu.: 6.000 1st Qu.: 9.00
## Median :3.820 Median : 46.00 Median : 9.000 Median :24.00
## Mean :3.813 Mean : 50.99 Mean : 9.646 Mean :23.46
## 3rd Qu.:4.300 3rd Qu.: 78.00 3rd Qu.:13.000 3rd Qu.:38.00
## Max. :8.000 Max. :148.00 Max. :32.000 Max. :68.00
##
## IPouts HA HRA BBA
## Min. : 162 Min. : 49 Min. : 0.0 Min. : 0.0
## 1st Qu.:4074 1st Qu.:1288 1st Qu.: 44.5 1st Qu.:427.0
## Median :4230 Median :1392 Median :108.0 Median :495.0
## Mean :4019 Mean :1345 Mean :100.5 Mean :474.1
## 3rd Qu.:4341 3rd Qu.:1470 3rd Qu.:147.0 3rd Qu.:556.0
## Max. :4518 Max. :1993 Max. :241.0 Max. :827.0
##
## SOA E DP FP
## Min. : 0.0 Min. : 47.0 Min. : 18.0 Min. :0.7600
## 1st Qu.: 500.0 1st Qu.:117.0 1st Qu.:126.0 1st Qu.:0.9600
## Median : 729.0 Median :146.0 Median :145.0 Median :0.9700
## Mean : 725.6 Mean :187.3 Mean :140.1 Mean :0.9613
## 3rd Qu.: 958.5 3rd Qu.:218.0 3rd Qu.:159.8 3rd Qu.:0.9800
## Max. :1450.0 Max. :639.0 Max. :217.0 Max. :0.9910
## NA's :317
## name park attendance BPF
## Length:2775 Length:2775 Min. : 6088 Min. : 60.0
## Class :character Class :character 1st Qu.: 524845 1st Qu.: 97.0
## Mode :character Mode :character Median :1125288 Median :100.0
## Mean :1330952 Mean :100.2
## 3rd Qu.:1984406 3rd Qu.:103.0
## Max. :4483350 Max. :129.0
## NA's :279
## PPF teamIDBR teamIDlahman45 teamIDretro
## Min. : 60.0 Length:2775 Length:2775 Length:2775
## 1st Qu.: 97.0 Class :character Class :character Class :character
## Median :100.0 Mode :character Mode :character Mode :character
## Mean :100.2
## 3rd Qu.:103.0
## Max. :141.0
##
##
## > file_path <- "TeamsFranchises.reformatted.csv"
##
## > df6 <- read.csv(file_path, stringsAsFactors = FALSE)
##
## > summary(df6)
## franchID franchName active
## Length:120 Length:120 Length:120
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## NAassoc
## Length:120
## Class :character
## Mode :character
This scatterplot was created to compare the number of ballots with the number of votes per year. The number of ballots are the number people who voted for the Hall of Fame during that specific year and the number of votes are the number of votes that a person has gained to be considered for the Hall of Fame. The green squares are the ballots for that specific year and the blue circles are specific people that have gained a certain number of votes. Also the color of the green squares are homeruns where the lighter the color the less number of homeruns and the darker the color the more number of homeruns. Furthermore, the color of the blue circles are the number of hits where the lighter the color the less number of hits the person had and the darker the color the more number of hits the person had. The reason that this visual is important is because it shows a positive correlation for both scatterplots and the color of the points get darker as the years go by which means that more people are voting for the Hall of Fame every year and players get more hits and homeruns than the year before. Finally, this shows that as the players get more hits and homeruns then more people are likely to vote for Hall of Fame.
Bar Chart 1: This barchart was created to look at the homerun average per hit of each team that was at bat at least 150k times. The bars were colored using a key performance indicator showing that any team who hit homeruns 10% or greater when getting a hit are blue while 5% to 10% are green and anything percent lower is orange. This shows that any team who were at bat higher than the average number of times at bat (386k) are only between 5% and 10% likely of getting a homerun when getting a hit. Furthermore, most of those below the average are either 10% and higher or less than 5% which is interesting how teams who had similar number of times at bat have a huge difference in number of homeruns.
Scatterplot 2: This scatterplot was created to compare the number of votes needed to be selected for the Hall of Fame with the number of games played per year. These two scatterplots show a positive correlation which is interesting because it introduces the possibiltly that as the number of games played over the years increase then the number of votes needed to be selected for the Hall of Fame also increase.
This visualization depicts the batting averages for each active team from 1950-2014 via boxplot. Each page details the average, median, upper and lower quartiles, as well as the outliars of each team’s batting average. This plot is interesting because it tells whether these values remained the same or fluctuated for each team as time passed by. For further analysis, this plot could be blended with the Teams csv in order to determine if the batting average for each team could be related to that team’s winning record (wins/games played).
Salary: In the csv for franchise we have a dataset of how much each franchise manages to spend on salary. This visualization was created to be able to compare how much of a correlation there is between salary spent and a teams winning ratio, a calculated field created from dividing the number of wins by total games. From this bar chart we can see that there is a strong correlation between them, which could mean several things. A casual relationship where the more money spent on a team the more likely they are t win, an effect relationship which means that because they win so much they therefore have the money to spend on salary. Or neither which would mean that both of them are being caused by a third unknown factor. But one thing is for certain is that they are both related.
Hall of Fame Geographic: In this visualization we obtained every baseball hall of famer and marked down they’re hometown on the map. Once there we color coded them by the persons category (ie. player, umpire, manager, etc). In the detail sheet we added the year they were added along with the name of the person. Something that can be observed about this graph is that it is very easy to see which areas of the United States have more often than not been prone to creating a baseball hall of Famer. One of the most notable observations is that the New England states have a much higher density in Baseball hall of famers than any other region, particularly the Midwest with little to now baseball legends.
Using the data from the franchise csv we are able to create a crosstab comparing the franchises over the course of several years with their winning ration, the calculated field mentioned earlier. Blending the franchise and team csv did this. In the scatter plot we are able to see whether a team had a winning record (>0.6) or a losing record (<0.4), color-coding it for ease of visibility. An observation we can make is many teams often got into streaks for several years. Either winning streaks or losing streaks would keep going, however this trend diminished by a big amount once more teams started being added to the league. One could say with increased competition the sport became harder to get on a certain type of streak.